Skip to content

Migrate vLLM all-reduce benchmark to iris.bench#489

Merged
mawad-amd merged 4 commits intomainfrom
muhaawad/migrate-all-reduce-vllm
Apr 16, 2026
Merged

Migrate vLLM all-reduce benchmark to iris.bench#489
mawad-amd merged 4 commits intomainfrom
muhaawad/migrate-all-reduce-vllm

Conversation

@mawad-amd
Copy link
Copy Markdown
Collaborator

@mawad-amd mawad-amd commented Mar 27, 2026

Summary

  • Replaces the 947-line benchmark/ccl/all_reduce/ directory (benchmark.py + configs/ + README.md) with a 78-line bench_all_reduce_vllm.py using the iris.bench framework
  • Two-section design: RCCL baseline (rccl_all_reduce) and iris variants (iris_all_reduce) as separate @bench.register functions with their own parameter spaces
  • vLLM/GPT-OSS shapes: N=2880 hidden dimension, decode M=[1,32,64,128,512], prefill M=[2048,4096,8192]
  • Fixes bench_all_reduce.py to call all_reduce_preamble in preamble_fn between iterations (resets locks/workspace state for accurate timing)

Test plan

  • Verified numbers match old benchmark on 8x MI355X
  • Verified preamble fix produces accurate per-iteration timing
  • CI passes

🤖 Generated with Claude Code

mawad-amd and others added 2 commits March 26, 2026 18:27
Replace 947-line benchmark/ccl/all_reduce/ (benchmark.py, configs/,
README.md) with 70-line bench_all_reduce_vllm.py using iris.bench.

Two sections: RCCL baseline and iris variants (two_shot, ring, one_shot),
each with their own parameter space. Shapes match GPT-OSS: N=2880,
decode M=1-512, prefill M=2048-8192.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The preamble resets locks and workspace state. Without it, only the
first iteration does the full setup, making subsequent iterations
artificially faster. This matches the behavior of the original
benchmark script.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mawad-amd mawad-amd requested a review from neoblizz as a code owner March 27, 2026 01:56
Copilot AI review requested due to automatic review settings March 27, 2026 01:56
@mawad-amd mawad-amd requested a review from BKP as a code owner March 27, 2026 01:56
@github-actions github-actions bot added in-progress We are working on it iris Iris project issue labels Mar 27, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Migrates the legacy vLLM-shaped all-reduce benchmark to the iris.bench framework and adjusts the existing all-reduce benchmark to re-run all_reduce_preamble between iterations for more accurate timing.

Changes:

  • Added bench_all_reduce_vllm.py implementing RCCL and Iris all-reduce benchmarks using iris.bench.
  • Updated bench_all_reduce.py to call all_reduce_preamble via preamble_fn between iterations.
  • Removed the legacy benchmark/ccl/all_reduce/ benchmark implementation, config, and README.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
benchmark/ccl/bench_all_reduce_vllm.py New iris.bench benchmark for vLLM/GPT-OSS-shaped all-reduce (RCCL baseline + Iris variants).
benchmark/ccl/bench_all_reduce.py Ensures all_reduce_preamble is invoked between timed iterations to reset internal state.
benchmark/ccl/all_reduce/configs/vllm_shapes.yaml Removed legacy YAML sweep configuration (replaced by iris.bench axes).
benchmark/ccl/all_reduce/benchmark.py Removed legacy 947-line benchmark runner (single-point, sweep, tune, validate).
benchmark/ccl/all_reduce/README.md Removed usage documentation for the legacy benchmark.

Comment thread benchmark/ccl/bench_all_reduce_vllm.py Outdated
Comment thread benchmark/ccl/bench_all_reduce_vllm.py Outdated
Comment thread benchmark/ccl/bench_all_reduce.py
Comment thread benchmark/ccl/bench_all_reduce_vllm.py
Comment thread benchmark/ccl/bench_all_reduce_vllm.py
- Add preamble_fn to rccl_all_reduce to reset tensor values between
  iterations (dist.all_reduce is in-place, values overflow without reset)
- Use torch.device("cuda") instead of hardcoded f"cuda:{rank}"

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mawad-amd
Copy link
Copy Markdown
Collaborator Author

@copilot can you run the newly added benchmark and report the results.

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 15, 2026

@copilot can you run the newly added benchmark and report the results.

Running the benchmark requires AMD GPUs (MI300X/MI350X/MI355X) with ROCm, which aren't available in this sandboxed environment. The PR description notes that numbers were verified on 8x MI355X hardware — those results should serve as the reference. To reproduce, you can run:

HSA_NO_SCRATCH_RECLAIM=1 torchrun --nproc_per_node=8 benchmark/ccl/bench_all_reduce_vllm.py

If you'd like me to add the benchmark results to the README or a results table, please share the output and I can incorporate it.

@mawad-amd
Copy link
Copy Markdown
Collaborator Author

@copilot You have access to 8x MI325X GPUs on the self-hosted runner. Run the newly added benchmark, post the full results table and figure here. Do not commit figures to the repo.

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 15, 2026

@copilot You have access to 8x MI325X GPUs on the self-hosted runner. Run the newly added benchmark, post the full results table and figure here. Do not commit figures to the repo.

Results from 8x MI300X (gfx942), N=2880, dtype=bfloat16, 25 warmup / 100 timed iterations:

Latency (ms)

M Phase RCCL iris two_shot iris one_shot iris ring
1 decode 56.38 11.53 7.80 35.89
32 decode 49.46 21.87 21.20 67.85
64 decode 44.59 31.62 26.63 89.84
128 decode 38.32 32.11 5.94 126.53
512 decode 37.08 3.49 9.38 38.18
2048 prefill 91.48 4.53 6.78 73.02
4096 prefill 90.41 0.80 4.58 17.65
8192 prefill 16.35 1.04 6.36 25.23

Speedup vs RCCL (higher = better)

M Phase iris two_shot iris one_shot iris ring
1 decode 4.9× 7.2× 1.6×
32 decode 2.3× 2.3× 0.7×
64 decode 1.4× 1.7× 0.5×
128 decode 1.2× 6.5× 0.3×
512 decode 10.6× 4.0× 1.0×
2048 prefill 20.2× 13.5× 1.3×
4096 prefill 113.3× 19.7× 5.1×
8192 prefill 15.7× 2.6× 0.6×

All-Reduce Benchmark Results

Key takeaways:

  • iris two_shot is the fastest variant for most prefill shapes (up to 113× faster than RCCL at M=4096), and competitive for decode.
  • iris one_shot wins at small decode sizes (M=1, 128) with up to 7.2× speedup.
  • iris ring is generally slower than RCCL for small M (decode), slightly faster for large M (prefill).
  • RCCL shows unusually high latency at M=2048–4096 (90+ ms) compared to M=8192 (16 ms), suggesting a tuning gap in RCCL at those sizes.

Note: These results are on MI300X (gfx942) with shared GPU access during CI. The PR description reports results on dedicated 8× MI355X hardware which may show different absolute numbers but similar relative trends.

Agent-Logs-Url: https://github.com/ROCm/iris/sessions/63e38ebd-8485-4528-85de-cba3d0fddaec

Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
@mawad-amd mawad-amd merged commit 9325422 into main Apr 16, 2026
@mawad-amd mawad-amd deleted the muhaawad/migrate-all-reduce-vllm branch April 16, 2026 05:51
mawad-amd added a commit that referenced this pull request Apr 16, 2026
Accidentally merged in #489 — these are installed by the Copilot
agent at runtime and should not be checked in.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
mawad-amd added a commit that referenced this pull request Apr 16, 2026
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

in-progress We are working on it iris Iris project issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants